cone rod homeobox (CRX) mutations can cause dominant Leber congenital amaurosis (LCA), a genetic disorder that can lead to blindness. Normally LCA is caused by recessive genes, but on rarer cases, it can be cause by dominant genes, such as CRX. We are look at Bulk RNA-seq data from retinal organoids derived from induced pluripotent stem cells (iPSCs) which is itself take from a LCA patient with CRX-I138fs mutation and their healthy parent. This mutation occurs in the transactivation domain, and so the effect of the mutation ist that CRX is not transcribed. The data source is GSE152939. The author is trying to show that a deno-associated virus gene therapy can restore RNA expression of CRX. However, although this study is about gene therapy, the data that show rescued expression is in single cell RNA-seq data, which is not in the scope of this report, so it was not included. The data was collected from 6 retinal organoids, for 90 days, 125 days, 150 days and 200 day after the iPSCs start to differentiate; 3 of the organoids were from LCA patient, and 3 organoids were from their healthy parent. (Kruczek et al. 2021)
Note that the vast majority of code in this report are take from BCB420 lecture (Isserlin 2022c, 2022a, 2022d, 2022b) with minor modification to make it appropriate for my data.
Note 2: I cannot put bibtex inline citatation within the codechunk, so it is left at the end of the code chunk, and sometimes there is a figure, so the figure legend/caption may be directly followed by the inline citation
Lets us first do all the processing that was done in assignment 1
source("./assignment_functions.r")
supplementaryFiles <- downloadSupplementaryFiles("GSE152939")
CRX_ExperimentRawCount <- readFile(supplementaryFiles)
CRX_ExperimentRawCountFiltered <- filterOutLowCount(CRX_ExperimentRawCount)
normalized_counts <- normalizeTheCounts(CRX_ExperimentRawCountFiltered)
(Isserlin 2022c, 2022a; Davis and Meltzer 2007; Durinck et al. 2009; Chen, Lun, and Smyth 2016)
ensemblDataSet <- getEnsemblbiomart()
(Isserlin 2022c, 2022a; Durinck et al. 2009)
normalized_counts_annot <- mapTheData(CRX_ExperimentRawCountFiltered, normalized_counts, ensemblDataSet)
(Isserlin 2022c, 2022a; Durinck et al. 2009)
saveRDS(normalized_counts_annot, file = "normalized_counts_annot.rds")
# Apprently there is a dupliate emseblem id
normalized_count_data <- normalized_counts_annot[!duplicated(normalized_counts_annot[ , c("ensembl_gene_id")]),]
normalized_counts_annot[duplicated(normalized_counts_annot[ , c("ensembl_gene_id")]),]
normalized_counts_annot[which(normalized_counts_annot$ensembl_gene_id == 'ENSG00000254876'), ]
Table 1a: The duplicate gene that was removed.
normalized_counts_annot[which(normalized_counts_annot$ensembl_gene_id == 'ENSG00000254876'), ]
Table 1b: showing the two gene has the same ensembl gene id, but not exactly the same hgnc symbol
I find the one that is consider as duplicate, then find the one that has the same enmselble gene id as it. For some reason this ensembl gene id maps to 2 hgnc symbols, SUGT1P4-STRA6LP and STRA6LP, even from the name it look like the same gene
normalized_count_cleaned <- normalized_count_data[normalized_count_data$ensembl_gene_id != 'ENSG00000156508', ]
first make a heat map matrix (the input thing use to generate a heat map)
heatmap_matrix <- normalized_count_cleaned[,
3:ncol(normalized_count_cleaned)]
rownames(heatmap_matrix) <- normalized_count_cleaned$ensembl_gene_id
colnames(heatmap_matrix) <- colnames(normalized_count_cleaned[, 3:ncol(normalized_count_cleaned)])
heatmap_matrix
saveRDS(heatmap_matrix, file = "heatmap_list.rds")
Table 2: The first 1000 row of the filtered normalized data for all samples, ordered by ensembl id
heatmap_matrix <- readRDS("./heatmap_list.rds")
then generate a heat map
library(ComplexHeatmap)
library(circlize)
if(min(heatmap_matrix) == 0){
heatmap_col = colorRamp2(c(0, max(heatmap_matrix)), c( "white", "red"))
} else {
heatmap_col = colorRamp2(c(min(heatmap_matrix), 0, max(heatmap_matrix)), c("blue", "white", "red"))
}
Heatmap(as.matrix(heatmap_matrix[1:1000,]),
show_row_dend = TRUE, show_column_dend = TRUE,
col=heatmap_col, show_column_names = TRUE,
show_row_names = FALSE, show_heatmap_legend = TRUE,
raster_resize_mat = TRUE, heatmap_legend_param = list(title = "Initial Heatmap"))
Figure 1: Initial heatmap for differential expression with the first 1000 genes, order by ensembl id. Only the first 1000 genes are shown due to rendering resource issues (Isserlin 2022d; Gu, Eils, and Schlesner 2016; Gu et al. 2014)
Now with row normalization (i.e. the kind with subtract the mean divided by standard deviation) we can see each gene better
heatmap_matrix <- t(scale(t(heatmap_matrix)))
if(min(heatmap_matrix) == 0){
heatmap_col = colorRamp2(c(0, max(heatmap_matrix)), c( "white", "red"))
} else {
heatmap_col = colorRamp2(c(min(heatmap_matrix), 0, max(heatmap_matrix)), c("blue", "white", "red"))
}
Heatmap(as.matrix(heatmap_matrix[1:1000,]),
show_row_dend = TRUE,show_column_dend = TRUE,
col=heatmap_col,show_column_names = TRUE,
show_row_names = FALSE,show_heatmap_legend = TRUE, heatmap_legend_param = list(title = " Heatmap with row normalization"))
Figure 2: Heatmap for differential expression of the first 1000 genes, ordered by ensembl id, after row normalization. Only the first 1000 genes are shown due to rendering resource issues (Isserlin 2022d; Gu, Eils, and Schlesner 2016; Gu et al. 2014)
you can see that the samples are more similar to each other at the same time point than whether or not it is control, I think that is reasonable since many gene get simulated and repressed during development, so the genes that are on is different from one development time point, to another.
We will be fitting the data to the a linear model and using empirical bayes to calculate p-values. Then we will using Benjamni-Hochberg for multiple hypothesis testing, for correcting our p-values.
The simple model is where the model accounts for the condition (control vs experiment), and the day that the data was collected on, but not account for which organoids is the data collected from.
samples <- data.frame(lapply(colnames(CRX_ExperimentRawCountFiltered)[3:ncol(CRX_ExperimentRawCountFiltered)], FUN=function(x){unlist(strsplit(x, split = "\\_"))[c(1, 2, 3)]}))
samples
colnames(samples) <- colnames(CRX_ExperimentRawCountFiltered)[3:ncol(CRX_ExperimentRawCountFiltered)]
rownames(samples) <- c("condition", "time","patient")
samples <- data.frame(t(samples))
model_design <- model.matrix(~ samples$condition+samples$time)
expressionMatrix <- as.matrix(normalized_count_data[,3:ncol(CRX_ExperimentRawCountFiltered)])
rownames(expressionMatrix) <-
normalized_count_data$ensembl_gene_id
colnames(expressionMatrix) <-
colnames(normalized_count_data)[3:ncol(CRX_ExperimentRawCountFiltered)]
minimalSet <- Biobase::ExpressionSet(assayData=expressionMatrix)
fit <- limma::lmFit(minimalSet, model_design)
fit2 <- limma::eBayes(fit,trend=TRUE)
topfit <- limma::topTable(fit2,
coef=ncol(model_design),
adjust.method = "BH",
number = nrow(expressionMatrix))
#merge hgnc names to topfit table
output_hits <- merge(normalized_count_data[,1:2],
topfit,
by.y=0,by.x=1,
all.y=TRUE)
#sort by pvalue
output_hits <- output_hits[order(output_hits$P.Value),]
Table 3: Showing for ecah sample, which organoid, at what time, and what condition it is
(Isserlin 2022d; Huber et al. 2015; Ritchie et al. 2015)
knitr::kable(output_hits[1:10,1:7],type="html",row.names = FALSE)
| ensembl_gene_id | hgnc_symbol | logFC | AveExpr | t | P.Value | adj.P.Val |
|---|---|---|---|---|---|---|
| ENSG00000225840 | -751.52636 | 278.308984 | -34.89691 | 0 | 0 | |
| ENSG00000243199 | -14.97061 | 6.890542 | -30.68555 | 0 | 0 | |
| ENSG00000280614 | -1251.17742 | 468.102238 | -21.35881 | 0 | 0 | |
| ENSG00000280800 | -1251.17742 | 468.102238 | -21.35881 | 0 | 0 | |
| ENSG00000281181 | -1251.17742 | 468.102238 | -21.35881 | 0 | 0 | |
| ENSG00000276043 | UHRF1 | 19.96635 | 8.419313 | 21.03885 | 0 | 0 |
| ENSG00000109991 | P2RX3 | 15.42168 | 4.861319 | 20.92939 | 0 | 0 |
| ENSG00000151615 | POU4F2 | 22.61264 | 6.640604 | 20.68397 | 0 | 0 |
| ENSG00000147432 | CHRNB3 | 8.65588 | 2.777748 | 20.59896 | 0 | 0 |
| ENSG00000076003 | MCM6 | 44.11041 | 47.070545 | 18.54065 | 0 | 0 |
Table 4: Show the limma output after fitting the data to the simple model, calculating the p-value, and correcting the p-value. This is only showing the first 10, ordered by p value. Note that the p value and adjusted p values are not actually 0, they are just rounded to 0
This is how many genes that have pvalue under 0.05
length(which(output_hits$P.Value < 0.05))
[1] 8178
This is how many genes that still pvalue of less than 0.05 after adjustment
length(which(output_hits$adj.P.Val < 0.05))
[1] 6174
The organoid model is where the model accounts for the condition (control vs experiment), and the day that the data was collected on, and for which organoids is the data collected from.
model_design_org <- model.matrix(
~ samples$patient + samples$condition + samples$time)
fit_pat <- limma::lmFit(minimalSet, model_design_org)
Change the model, fit the data to the model, then apply empircal bayes , then correct with BH
fit_org <- limma::lmFit(minimalSet, model_design_org)
fit2_org <- limma::eBayes(fit_pat,trend=TRUE)
topfit_org <- limma::topTable(fit2_pat,
coef=ncol(model_design_org),
adjust.method = "BH",
number = nrow(expressionMatrix))
#merge hgnc names to topfit table
output_hits_org <- merge(normalized_count_data[,1:2],
topfit_org, by.y=0, by.x=1, all.y=TRUE)
#sort by pvalue
output_hits_org <- output_hits_org[order(output_hits_org$P.Value),]
(Isserlin 2022d; Ritchie et al. 2015)
knitr::kable(output_hits_org[1:10,1:7],type="html",row.names = FALSE)
| ensembl_gene_id | hgnc_symbol | logFC | AveExpr | t | P.Value | adj.P.Val |
|---|---|---|---|---|---|---|
| ENSG00000225840 | -751.52636 | 278.308984 | -34.77848 | 0 | 0 | |
| ENSG00000243199 | -14.97061 | 6.890542 | -29.37584 | 0 | 0 | |
| ENSG00000151615 | POU4F2 | 22.61264 | 6.640604 | 21.07068 | 0 | 0 |
| ENSG00000147432 | CHRNB3 | 8.65588 | 2.777748 | 21.01946 | 0 | 0 |
| ENSG00000276043 | UHRF1 | 19.96635 | 8.419313 | 20.97652 | 0 | 0 |
| ENSG00000109991 | P2RX3 | 15.42168 | 4.861319 | 20.54571 | 0 | 0 |
| ENSG00000280614 | -1251.17742 | 468.102238 | -20.49586 | 0 | 0 | |
| ENSG00000280800 | -1251.17742 | 468.102238 | -20.49586 | 0 | 0 | |
| ENSG00000281181 | -1251.17742 | 468.102238 | -20.49586 | 0 | 0 | |
| ENSG00000165646 | SLC18A2 | 25.24788 | 8.885919 | 19.54160 | 0 | 0 |
Table 5: Show the limma output after fitting the data to the organoid model, calculating the p-value, and correcting the p-value. This is only showing the first 10, ordered by p-value. Note that the p value and adjusted p values are not actually 0, they are just rounded to 0
length(which(output_hits_org$P.Value < 0.05))
[1] 7838
length(which(output_hits_org$adj.P.Val < 0.05))
[1] 5677
There is actually less, but let us look at it from a graph
simple_model_pvalues <- data.frame(ensembl_id =
output_hits$ensembl_gene_id,
simple_pvalue=output_hits$adj.P.Val)
org_model_pvalues <- data.frame(ensembl_id =
output_hits_org$ensembl_gene_id,
organoid_pvalue = output_hits_org$adj.P.Val)
two_models_pvalues <- merge(simple_model_pvalues,
org_model_pvalues,by.x=1,by.y=1)
two_models_pvalues$colour <- "black"
two_models_pvalues$colour[
two_models_pvalues$simple_pvalue<0.05] <- "orange"
two_models_pvalues$colour[
two_models_pvalues$organoid_pvalue<0.05] <- "blue"
two_models_pvalues$colour[
two_models_pvalues$simple_pvalue<0.05 &
two_models_pvalues$organoid_pvalue<0.05] <- "red"
plot(two_models_pvalues$simple_pvalue,
two_models_pvalues$organoid_pvalue,
col = two_models_pvalues$colour,
xlab = "Simple model corrected p-values",
ylab ="Organoid model corrected p-values",
main="Simple vs Organoid Limma",
xlim = c(0, 0.1),
ylim = c(0, 0.1))
Figure 3: Showing each gene on a graph where the coordinates is the corrected p-values using the simple model and the organoid model in limma, zoomed in on 0 to 0.1. Highlighting significant genes in both model as red, in simple model only as orange, in organoid model only as red, and non-significant genes as black (Isserlin 2022d)
To determine which model is more appropriate, we take a look at p-value of the gene of interest, CRX.
ensembl_of_interest <- normalized_count_data$ensembl_gene_id[
which(normalized_count_data$hgnc_symbol == "CRX")]
two_models_pvalues$colour <- "grey"
two_models_pvalues$colour[two_models_pvalues$ensembl_id==
ensembl_of_interest] <- "red"
plot(two_models_pvalues$simple_pvalue,
two_models_pvalues$organoid_pvalue,
col = two_models_pvalues$colour,
xlab = "simple model corrected p-values",
ylab ="organoid model corrected p-values",
main="Simple vs Organoid Limma",
xlim = c(0, 0.1),
ylim = c(0, 0.1))
points(two_models_pvalues[which(
two_models_pvalues$ensembl_id == ensembl_of_interest),2:3],
pch=20, col="red", cex=1.5)
legend(0,1,legend=c("CRX","rest"),
fill=c("red","grey"),cex = 0.7)
Figure 4: Showing each gene on a graph where the coordinates is the corrected p-values using the simple model and the organoid model in limma, zoomed in on 0 to 0.1. Highlighting only the gene of interest, CRX as red (Isserlin 2022d)
The simple model is better, I believe that is the because there are minimal difference between the retinal organoids, since they are derived from iPSC that is from the same patient.
Let us look at the heatmap of only the top hits (by corrected pvalue), in the simple model
top_hits <- output_hits$ensembl_gene_id[
output_hits$adj.P.Val<0.05]
heatmap_matrix_tophits <- t(
scale(t(heatmap_matrix[
which(rownames(heatmap_matrix) %in% top_hits),])))
if(min(heatmap_matrix_tophits) == 0){
heatmap_col = colorRamp2(c( 0, max(heatmap_atrix_tophits)),
c( "white", "red"))
} else {
heatmap_col = colorRamp2(c(min(heatmap_matrix_tophits), 0,
max(heatmap_matrix_tophits)), c("blue", "white", "red"))
}
Heatmap(as.matrix(heatmap_matrix_tophits[1:1000,]),
cluster_rows = TRUE,
cluster_columns = TRUE,
show_row_dend = TRUE,
show_column_dend = TRUE,
col=heatmap_col,
show_column_names = TRUE,
show_row_names = FALSE,
show_heatmap_legend = TRUE,
heatmap_legend_param = list(title = "Heatmap from Limma")
)
Figure 5: Heatmap for differential expression of the first 1000 genes, ordered by ensembl id, after row normalization for significant genes by corrected p-value using the simple model in limma. Only the first 1000 genes are shown due to rendering resource issues (Isserlin 2022d)
Let us not cluster the genes but group controls and experiments together
top_hits <- output_hits$ensembl_gene_id[
output_hits$adj.P.Val<0.05]
heatmap_matrix_tophits <- t(
scale(t(heatmap_matrix[
which(rownames(heatmap_matrix) %in% top_hits),])))
heatmap_matrix_tophits <- heatmap_matrix_tophits[,
c(
grep(colnames(heatmap_matrix_tophits), pattern = "\\CRXLCA"),
grep(colnames(heatmap_matrix_tophits), pattern = "\\Control")
)
]
if(min(heatmap_matrix_tophits) == 0){
heatmap_col = colorRamp2(c( 0, max(heatmap_matrix_tophits)),
c( "white", "red"))
} else {
heatmap_col = colorRamp2(c(min(heatmap_matrix_tophits), 0,
max(heatmap_matrix_tophits)),
c("blue", "white", "red"))
}
Heatmap(as.matrix(heatmap_matrix_tophits[1:1000,]),
cluster_rows = TRUE,
cluster_columns = FALSE,
show_row_dend = TRUE,
show_column_dend = TRUE,
col=heatmap_col,
show_column_names = TRUE,
show_row_names = FALSE,
show_heatmap_legend = TRUE,
heatmap_legend_param = list(title = "Limma Heatmap without clustering"))
Figure 6: Heatmap for differential expression of the first 1000 genes, ordered by ensembl id, after row normalization for significant genes based on corrected p-value using the simple model in limma, without clustering, and group control and experiment together, respectively. Only the first 1000 genes are shown due to rendering resource issues (Isserlin 2022d)
We can clearly tell that the test condition do not look like each other. The controls also do not look like each other.
Then again look at top hits with less than 0.01 pvalue
top_hits <- output_hits$ensembl_gene_id[
output_hits$adj.P.Val < 0.01]
heatmap_matrix_tophits <- t(
scale(t(heatmap_matrix[which(rownames(heatmap_matrix) %in% top_hits),])))
heatmap_matrix_tophits <- heatmap_matrix_tophits[,
c(grep(colnames(heatmap_matrix_tophits),pattern = "\\CRXLCA"),
grep(colnames(heatmap_matrix_tophits),pattern = "\\Control"))]
nrow(heatmap_matrix_tophits)
[1] 3726
if(min(heatmap_matrix_tophits) == 0){
heatmap_col = colorRamp2(c( 0, max(heatmap_matrix_tophits)),
c( "white", "red"))
} else {
heatmap_col = colorRamp2(c(min(heatmap_matrix_tophits), 0,
max(heatmap_matrix_tophits)),
c("blue", "white", "red"))
}
Heatmap(as.matrix(heatmap_matrix_tophits[1:1000,]),
cluster_rows = TRUE, show_row_dend = TRUE,
cluster_columns = FALSE,show_column_dend = FALSE,
col=heatmap_col,show_column_names = TRUE,
show_row_names = FALSE,show_heatmap_legend = TRUE, heatmap_legend_param = list(title = "Limma Heatmap, no cluster, 0.1"))
Figure 7: Heatmap for differential expression of the first 1000 genes, ordered by ensembl id, after row normalization for hits with corrected p-value smaller than 0.01 using the simple model in limma, without clustering, and group control and experiment together, respectively. Only the first 1000 genes are shown due to rendering resource issues (Isserlin 2022d; Gu, Eils, and Schlesner 2016; Gu et al. 2014)
The simple model is where the model accounts for the condition (control vs experiment), and the day that the data was collected on, but not account for which organoids is the data collected from.
filtered_data_matrix <- as.matrix(CRX_ExperimentRawCountFiltered[,3:28])
rownames(filtered_data_matrix) <- CRX_ExperimentRawCountFiltered$ens.id
d <- edgeR::DGEList(counts=filtered_data_matrix, group=samples$condition)
d <- edgeR::estimateDisp(d, model_design)
fit <- edgeR::glmQLFit(d, model_design)
(Isserlin 2022d; Chen, Lun, and Smyth 2016)
qlf.pos_vs_neg <- edgeR::glmQLFTest(fit, coef='samples$conditionCRXLCA')
knitr::kable(edgeR::topTags(qlf.pos_vs_neg), type="html",row.names = FALSE)
|
|
|
|
(Isserlin 2022d; Chen, Lun, and Smyth 2016; Xie 2021)
Even though this say FDR, but since we are using the same multiple hypothesis testing method for both edgeR and limma, they are the same type of corrected values. The first one is gene before correction, and second one is after correction
qlf_output_hits <- edgeR::topTags(qlf.pos_vs_neg,sort.by = "PValue",
n = nrow(normalized_count_data))
length(which(qlf_output_hits$table$PValue < 0.05))
[1] 6909
length(which(qlf_output_hits$table$FDR < 0.05))
[1] 4652
The organoid Model is where the model accounts for the condition (control vs experiment), and the day that the data was collected on, and for which organoids is the data collected from.
filtered_data_matrix <- as.matrix(CRX_ExperimentRawCountFiltered[,3:28])
rownames(filtered_data_matrix) <- CRX_ExperimentRawCountFiltered$ens.id
d <- edgeR::DGEList(counts=filtered_data_matrix, group=samples$condition)
d_org <- edgeR::estimateDisp(d, model_design_org)
fit_org <- edgeR::glmQLFit(d_org, model_design_org)
(Isserlin 2022d; Chen, Lun, and Smyth 2016)
qlf.pos_vs_neg <- edgeR::glmQLFTest(fit_org, coef='samples$conditionCRXLCA')
knitr::kable(edgeR::topTags(qlf.pos_vs_neg), type="html",row.names = FALSE)
|
|
|
|
(Isserlin 2022d; Xie 2021; Chen, Lun, and Smyth 2016)
The first one is gene before correction, and second one is after correction
qlf_output_hits_org <- edgeR::topTags(qlf.pos_vs_neg,sort.by = "PValue",
n = nrow(normalized_count_data))
length(which(qlf_output_hits_org$table$PValue < 0.05))
[1] 6556
length(which(qlf_output_hits_org$table$FDR < 0.05))
[1] 4222
simple_model_pvalues_qlf <- data.frame(ensembl_id =
rownames(qlf_output_hits$table),
simple_pvalue = qlf_output_hits$table$FDR)
org_model_pvalues_qlf <- data.frame(ensembl_id =
rownames(qlf_output_hits_org$table),
organoid_pvalue = qlf_output_hits_org$table$FDR)
two_models_pvalues <- merge(simple_model_pvalues_qlf,
org_model_pvalues_qlf, by.x=1, by.y=1)
two_models_pvalues$colour <- "black"
two_models_pvalues$colour[
two_models_pvalues$simple_pvalue<0.05] <- "orange"
two_models_pvalues$colour[
two_models_pvalues$organoid_pvalue<0.05] <- "blue"
two_models_pvalues$colour[
two_models_pvalues$simple_pvalue<0.05 &
two_models_pvalues$organoid_pvalue<0.05] <- "red"
plot(two_models_pvalues$simple_pvalue,
two_models_pvalues$organoid_pvalue,
col = two_models_pvalues$colour,
xlab = "Simple model corrected p-values",
ylab ="Organoid model corrected p-values",
main="Simple vs Organoid edgeR",
xlim=c(0, 0.1),
ylim=c(0, 0.1))
Figure 8: Showing each gene on a graph where the coordinates is the corrected p-values using the simple model and the organoid model in glmQLFit, zoomed in on 0-0.1. Highlighting significant genes in both model as red, in simple model only as orange, in organoid model only as red, and non-significant genes as black (Isserlin 2022d)
The majority of genes that have less than 0.05 p-value in either model, has less than 0.05 p-value in both model.
To determine which model is more appropriate, we take a look at p=value of the gene of interest, CRX.
ensembl_of_interest <- normalized_count_data$ensembl_gene_id[
which(normalized_count_data$hgnc_symbol == "CRX")]
two_models_pvalues$colour <- "grey"
two_models_pvalues$colour[two_models_pvalues$ensembl_id==
ensembl_of_interest] <- "red"
plot(two_models_pvalues$simple_pvalue,
two_models_pvalues$organoid_pvalue,
col = two_models_pvalues$colour,
xlab = "simple model corrected p-values",
ylab ="organoid model corrected p-values",
main="Simple vs Organoid Limma",
xlim = c(0, 0.01),
ylim = c(0, 0.01))
points(two_models_pvalues[which(
two_models_pvalues$ensembl_id == ensembl_of_interest),2:3],
pch=20, col="red", cex=1.5)
legend(0,1,legend=c("CRX","rest"),
fill=c("red","grey"),cex = 0.7)
Figure 9: Showing each gene on a graph where the coordinates is the corrected p-values using the simple model and the organoid model in glmQLFit, zoomed in on 0-0.01. Highlighting only the gene of interest, CRX (Isserlin 2022d)
The simple model is better, I believe this is still due to the same reason that, there are minimal difference between the retinal organoids, since they are derived from iPSC that is from the same patient.
qlf_simple_model_pvalues <- data.frame(
ensembl_id = rownames(qlf_output_hits$table),
qlf_simple_pvalue = qlf_output_hits$table$FDR)
limma_simple_model_pvalues <- data.frame(
ensembl_id = output_hits$ensembl_gene_id,
limma_simple_pvalue = output_hits$adj.P.Val)
two_models_pvalues <- merge(qlf_pat_model_pvalues,
limma_pat_model_pvalues,
by.x=1,by.y=1)
two_models_pvalues$colour <- "black"
two_models_pvalues$colour[two_models_pvalues$qlf_simple_pvalue
<0.05] <- "orange"
two_models_pvalues$colour[two_models_pvalues$limma_simple_pvalue
<0.05] <- "blue"
two_models_pvalues$colour[two_models_pvalues$qlf_simple_pvalue
<0.05 &
two_models_pvalues$limma_simple_pvalue<0.05] <- "red"
plot(two_models_pvalues$qlf_simple_pvalue,
two_models_pvalues$limma_simple_pvalue,
col = two_models_pvalues$colour,
xlab = "QLF simple model adjusted p-values",
ylab ="Limma simple model adjusted p-values",
main="QLF vs Limma",
xlim = c(0, 1),
ylim = c(0, 1))
Figure 10: Showing each gene on a graph where the coordinates is the corrected p-values using the simple model in glmQLFit and in limma. Highlighting significant genes in both model as red, in simple model only as orange, in organoid model only as red, and non-significant genes as black (Isserlin 2022d)
For some reason, this is all over the place
ensembl_of_interest <- normalized_count_data$ensembl_gene_id[
which(normalized_count_data$hgnc_symbol == "CRX")]
two_models_pvalues$colour <- "grey"
two_models_pvalues$colour[two_models_pvalues$ensembl_id
==ensembl_of_interest] <- "red"
plot(two_models_pvalues$qlf_simple_pvalue,
two_models_pvalues$limma_simple_pvalue,
col = two_models_pvalues$colour,
xlab = "QLF simple model adjusted p-values",
ylab ="Limma simple model adjusted p-values",
main="QLF vs Limma",
xlim = c(0, 0.02),
ylim = c(0, 0.02))
points(two_models_pvalues[
two_models_pvalues$ensembl_id==ensembl_of_interest,2:3],
pch=24, col="red", cex=1.5)
Figure 11: Showing each gene on a graph where the coordinates is the corrected p-values using the simple model in glmQLFit and in limma, zoomed in on 0-0.02. Highlighting only the gene of interest, CRX (Isserlin 2022d)
It looks like QLF is better than limma, and that does make sense, since limma was made for microarray, but I am using bulk RNA-seq data, which edgeR methods are designed for.
heatmap, but using tophits from qlf instead of limma
top_hits <- rownames(qlf_output_hits$table)[
qlf_output_hits$table$FDR<0.05]
heatmap_matrix_tophits <- t(
scale(t(heatmap_matrix[which(rownames(heatmap_matrix)
%in% top_hits),])))
nrow(heatmap_matrix_tophits)
[1] 4651
if(min(heatmap_matrix_tophits) == 0){
heatmap_col = colorRamp2(c( 0, max(heatmap_matrix_tophits)),
c( "white", "red"))
} else {
heatmap_col = colorRamp2(c(min(heatmap_matrix_tophits), 0,
max(heatmap_matrix_tophits)),
c("blue", "white", "red"))
}
Heatmap(as.matrix(heatmap_matrix_tophits[1:1000,]),
cluster_rows = TRUE,
cluster_columns = TRUE,
show_row_dend = TRUE,
show_column_dend = TRUE,
col=heatmap_col,
show_column_names = TRUE,
show_row_names = FALSE,
show_heatmap_legend = TRUE,
heatmap_legend_param = list(title = "glmQLFit Heatmap")
)
Figure 12: Heatmap for differential expression of the first 1000 genes, ordered by ensembl id, after row normalization for significant genes using the simple model in glmQLFit. Only the first 1000 genes are shown due to rendering resource issues (Isserlin 2022d; Gu, Eils, and Schlesner 2016; Gu et al. 2014)
compare to the heatmap from lemma, the day 200 do not cluster together, but the day 150 and day 125 cluster into 2 groups. However all day 90 do cluster together.
then also cluster by control vs experiment
top_hits <- rownames(qlf_output_hits$table)[qlf_output_hits$table$FDR
<0.05]
heatmap_matrix_tophits <- t(
scale(t(heatmap_matrix[which(rownames(heatmap_matrix)
%in% top_hits),])))
heatmap_matrix_tophits<- heatmap_matrix_tophits[,
c(grep(colnames(heatmap_matrix_tophits),pattern = "\\CRXLCA"),
grep(colnames(heatmap_matrix_tophits),pattern = "\\Control"))]
nrow(heatmap_matrix_tophits)
[1] 4651
if(min(heatmap_matrix_tophits) == 0){
heatmap_col = colorRamp2(c( 0, max(heatmap_matrix_tophits)),
c( "white", "red"))
} else {
heatmap_col = colorRamp2(c(min(heatmap_matrix_tophits), 0,
max(heatmap_matrix_tophits)),
c("blue", "white", "red"))
}
Heatmap(as.matrix(heatmap_matrix_tophits[1:1000,]),
cluster_rows = TRUE,
cluster_columns = FALSE,
show_row_dend = TRUE,
show_column_dend = FALSE,
col=heatmap_col,
show_column_names = TRUE,
show_row_names = FALSE,
show_heatmap_legend = TRUE,
heatmap_legend_param = list(title = "glmQLFit Heatmap no cluster"))
Figure 13: Heatmap for differential expression of the first 1000 genes, ordered by ensembl id, after row normalization for significant genes using the simple model in glmQLFit, without clustering, and group control and experiment together, respectively. Only the first 1000 genes are shown due to rendering resource issues (Isserlin 2022d; Gu, Eils, and Schlesner 2016; Gu et al. 2014)
This is the number of genes upregulated and downregulated (I did check the model design, control was labeled as 0 and experiment/CRX was labeled as 1), so this does mean that that is upregulated and downregulated in experiment.
length(which(qlf_output_hits$table$FDR < 0.05
& qlf_output_hits$table$logFC > 0))
[1] 2374
length(which(qlf_output_hits$table$FDR < 0.05
& qlf_output_hits$table$logFC < 0))
[1] 2278
I only outputted the first 5000 genes, because g:profiler will crash with too many genes.
qlf_output_hits_withgn <- merge(CRX_ExperimentRawCountFiltered[,1:2], qlf_output_hits, by.x=1, by.y = 0)
qlf_output_hits_withgn[,"rank"] <- -log(qlf_output_hits_withgn$FDR, base =10) * sign(qlf_output_hits_withgn$logFC)
qlf_output_hits_withgn <- qlf_output_hits_withgn[order(qlf_output_hits_withgn$rank),]
upregulated_genes <- qlf_output_hits_withgn$gene.id[
which(qlf_output_hits_withgn$FDR < 0.05
& qlf_output_hits_withgn$logFC > 0)]
downregulated_genes <- qlf_output_hits_withgn$gene.id[
which(qlf_output_hits_withgn$FDR < 0.05
& qlf_output_hits_withgn$logFC < 0)]
first5000_genes <- qlf_output_hits_withgn$gene.id[c(1:5000)]
if (!file.exists("data")) {
dir.create("data")
}
write.table(x=upregulated_genes,
file=file.path("data","CRX_upregulated_genes.txt"), sep = "\t",
row.names = FALSE, col.names = FALSE, quote = FALSE)
write.table(x=downregulated_genes,
file=file.path("data","CRX_downregulated_genes.txt"),sep = "\t",
row.names = FALSE, col.names = FALSE, quote = FALSE)
write.table(x=first5000_genes,
file=file.path("data","CRX_ranked_genelist.txt"),
sep = "\t",
row.names = FALSE,
col.names = FALSE,
quote = FALSE)